This project analyzes beer and brewery data. Exploratory data analysis is conducted on the data set to determine which states in the USA are dominant in brewing beer. The data set contains missing values, and the researchers’ analysis details the best way to deal with these missing values. The researchers will explore relationships between ABV and IBU in addition to geographical trends. Finally, the researchers will build models to attempt to predict given beer styles from ABV and IBUs of a beer and investigate other interesting relationships in the beer data set relating to ABV, IBU, and beer ratings.
Colorado and California have the most breweries. Each US region has 1-2 states with more breweries as well:
Northwest: Oregon and Washington
South: Texas
Midwest: Michigan
Northeast: Pennsylvania and Massachusettes
## Brew_ID Name.Brewery City State Name.Beer Beer_ID ABV IBU
## 1 1 NorthGate Brewing Minneapolis MN Pumpion 2689 0.060 38
## 2 1 NorthGate Brewing Minneapolis MN Stronghold 2688 0.060 25
## 3 1 NorthGate Brewing Minneapolis MN Parapet ESB 2687 0.056 47
## 4 1 NorthGate Brewing Minneapolis MN Get Together 2692 0.045 50
## 5 1 NorthGate Brewing Minneapolis MN Maggie's Leap 2691 0.049 26
## 6 1 NorthGate Brewing Minneapolis MN Wall's End 2690 0.048 19
## Style Ounces
## 1 Pumpkin Ale 16
## 2 American Porter 16
## 3 Extra Special / Strong Bitter (ESB) 16
## 4 American IPA 16
## 5 Milk / Sweet Stout 16
## 6 English Brown Ale 16
## Brew_ID Name.Brewery City State
## 2405 556 Ukiah Brewing Company Ukiah CA
## 2406 557 Butternuts Beer and Ale Garrattsville NY
## 2407 557 Butternuts Beer and Ale Garrattsville NY
## 2408 557 Butternuts Beer and Ale Garrattsville NY
## 2409 557 Butternuts Beer and Ale Garrattsville NY
## 2410 558 Sleeping Lady Brewing Company Anchorage AK
## Name.Beer Beer_ID ABV IBU Style Ounces
## 2405 Pilsner Ukiah 98 0.055 NA German Pilsener 12
## 2406 Porkslap Pale Ale 49 0.043 NA American Pale Ale (APA) 12
## 2407 Snapperhead IPA 51 0.068 NA American IPA 12
## 2408 Moo Thunder Stout 50 0.049 NA Milk / Sweet Stout 12
## 2409 Heinnieweisse Weissebier 52 0.049 NA Hefeweizen 12
## 2410 Urban Wilderness Pale Ale 30 0.049 NA English Pale Ale 12
There are 5 missing values for Style, 62 missing values for ABV, and 1,005 missing values for IBU.
Beer Style is a grouping/classification for beers that’s been established by brewers based on brewing traditions and their domain expertise. Beers within an established Beer Style tend to have more similar alcohol content and bitterness (low within group variation), whereas beers that differ in Style tend to have less similar ABV and IBUs (high between group variation). Based on this knowledge, median ABV and IBU values were calculated for all 100 Beer Styles. Missing ABV and IBU values were generally addressed by replacing NA’s with the matching Beer Style’s median values.
2 of the 5 missing Style values were imputed based on the individual beer names. “OktoberFiesta” had a beer name, ABV, and IBU that were consistent with the other Oktoberfest style beers. Similarly, “Kilt Lifter Scottish-Style Ale” was consistent with the Scottish Ales style beers. The remaining 3 beers with missing Style values did not have enough information to classify their Style. For those 3, Style was left blank, and the missing IBU and ABV values were set to the overall median values of the entire data set.
55 of the 1,005 missing IBU values could not be imputed this way because those Styles were missing all IBU values. These 55 beers were associated with 10 unique Styles that had no IBU values. Ciders, Meads, Shandies, and Rauchbiers (smoked beers) styles are all typically made with no hops or bittering of any kind. These missing IBU values were set to 0 based on this domain knowledge. There were 2 exceptions where the product names indicated there may be some hops added, contrary to the style conventions: Cider “Nunica Pine” and Mead “Nectar of Hops”. For these 2, the missing IBU values were set to the overall median IBU value for the entire data set.
Finally, the remaining 11 missing IBU values were set to the overall median IBU value for the entire data set.
All further analyses were done by both excluding the missing values and by using the imputed values.
Utah has the lowest median ABV likely due to its strcit state alcohol laws and regulations. Kentucky and Washington DC have the highest median ABV.
Replacing missing values for ABV does not really change the trend. This is not surprising since we’re only missing 2-3% of the data. Delaware does get a notable bump in the trend.
Replacing missing values for IBU makes a big difference! There are big swings in the trends after imputing 40% of the data based on beer styles. The Northeastern states have the most notable shifts: Maine, New Hampshire, and Vermont. Some of the lower values for NH are driven by missing data for sour beers which have very low IBUs. The higher values for VT are driven by hoppier Pale Ales and IPAs that may be part of the trending “Juicy/Hazy” New England style IPAs.
Colorado has the maximum alcoholic (ABV) beer of 12.8%, “Lee Hill Series Vol. 5 - Belgian Style Quadrupel Ale” by Upslope Brewing.
Oregon has the most bitter (IBU) beer of 138 IBU, “Bitter Bitch Imperial IPA” by Astoria Brewing Company.
## Style Brew_ID Name.Brewery City State
## 2170 Quadrupel (Quad) 52 Upslope Brewing Company Boulder CO
## Name.Beer Beer_ID ABV IBU
## 2170 Lee Hill Series Vol. 5 - Belgian Style Quadrupel Ale 2565 0.128 NA
## Ounces median.ABV count.x ABV1 median.IBU count.y IBU1
## 2170 19.2 0.099 4 0.128 24 1 24
## Style Brew_ID Name.Brewery City
## 518 American Double / Imperial IPA 375 Astoria Brewing Company Astoria
## State Name.Beer Beer_ID ABV IBU Ounces median.ABV count.x
## 518 OR Bitter Bitch Imperial IPA 980 0.082 138 12 0.087 103
## ABV1 median.IBU count.y IBU1
## 518 0.082 91 75 138
There appears to be an approximately linear relationship between %ABV and IBU that may have some curvature. This positively correlated relationship is likely because people like drinking balanced beers. Higher alcohol beers tend to also be maltier/sweeter (higher residual sugar) which balances the high bitterness (high IBUs).
There appears to be a boundary near ABV=10% that most beers don’t cross. This may be due to limitations on product cost, beer yeast survival at higher ABV, or state/federal regulations on beer that is 10% ABV or more.
The imputed IBU values are evident from the vertical bands at IBU=0, IBU=median(IBU), etc.
## Warning: Removed 1005 rows containing non-finite values (stat_sum).
## Warning: Removed 1005 rows containing non-finite values (stat_smooth).
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
First, find beers with “IPA” or “Ale” directly in their name for the training set (ex: Ranger IPA). Second, find IPAs and Ales without those identifiers in their names for the test set (ex: .
Using predicted values for missing data, the kNN model predicts IPAs vs. Ales reasonably well with 90% accuracy.
There is a trade-off between Sensitivity and Specificity with the kNN model. k=5 gives higher sensitivity (classifies fewer Ales incorrectly as IPAs), but k=11 gives higher specificity (classifies fewer IPAs incorrectly as Ales).
In general, the model with excluded NA values performed worse than the model with predicted values for NA’s.
IPAs are generally (but not always) hoppier and boozier than Pale Ales. American Double / Imperial IPAs are pushing this trend to new limits. Belgian Strong Pale Ales are outliers and may be more similar to Belgian Strong Ales rather than Pale Ales.
## Pale Ale White IPA Strong Pale Ale
## 281 11 7
## IPA Double / Imperial IPA
## 455 105
## Warning: Removed 299 rows containing non-finite values (stat_smooth).
## Warning: Removed 299 rows containing missing values (geom_point).
## Warning: Removed 299 rows containing non-finite values (stat_smooth).
## Warning: Removed 299 rows containing missing values (geom_point).
Deeper dive on alcohol content and beer ratings from Kaggle data set. Yes, people like boozier beers which is a statistically significant result using ANOVA.
## Warning: Removed 67785 rows containing non-finite values (stat_ydensity).
## Df Sum Sq Mean Sq F value Pr(>F)
## beer_reviews$group 4 144907 36227 6837 <2e-16 ***
## Residuals 1518817 8047779 5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 67792 observations deleted due to missingness
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = beer_reviews$beer_abv ~ beer_reviews$group)
##
## $`beer_reviews$group`
## diff lwr upr p adj
## 2-1 0.02771935 -0.01927193 0.07471063 0.4914803
## 3-1 0.39868193 0.35553895 0.44182491 0.0000000
## 4-1 0.90598781 0.86335694 0.94861869 0.0000000
## 5-1 1.12213589 1.07506996 1.16920183 0.0000000
## 3-2 0.37096258 0.34805164 0.39387352 0.0000000
## 4-2 0.87826846 0.85633708 0.90019984 0.0000000
## 5-2 1.09441654 1.06477205 1.12406104 0.0000000
## 4-3 0.50730588 0.49572476 0.51888700 0.0000000
## 5-3 0.72345396 0.70039029 0.74651764 0.0000000
## 5-4 0.21614808 0.19405719 0.23823897 0.0000000
##
## Pearson's product-moment correlation
##
## data: beer_reviews$beer_abv and beer_reviews$review_overall
## t = 172.34, df = 1518820, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1369288 0.1400485
## sample estimates:
## cor
## 0.138489
In conclusion, there is a wealth of information in this data set on Beers, Breweries, States, and beer characteristics (Style, IBU, ABV, etc.):
- There are dominant brewing states in each region of the US.
- Missing IBU data may be important for assessing emerging regional trends. - Alcohol (ABV) content is shifting higher driven by demand for bigger, boozier beers. - People generally like drinking beers that are balanced for alcohol/sweetness vs. bitterness. - Some beer styles can be reliably identified based only on the IBU and ABV content.
6. Comment on the summary statistics and distribution of the ABV variable
ABV values range from 0.1% to 12.8% with a median of 5.6% and a mean of 6.0%. The higher mean vs. the median indicates the distribution is right-skewed, and the histogram plot visually confirms this. This right-skew may indicate a shift in the beer market towards “bigger” high alcohol beers.